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Abstract 



We investigate the generalization ability of a perceptron with non-monotonic 
transfer function of a reversed-wedge type in on-line mode. This network 
is identical to a parity machine, a multilayer network. We consider several 
learning algorithms. By the perceptron algorithm the generalization error is 
shown to decrease by the a~^/^-law similarly to the case of a simple perceptron 
in a restricted range of the parameter a characterizing the non-monotonic 
transfer function. For other values of a, the perceptron algorithm leads to 
the state where the weight vector of the student is just opposite to that of 
the teacher. The Hebbian learning algorithm has a similar property; it works 
only in a limited range of the parameter. The conventional AdaTron algorithm 
does not give a vanishing generalization error for any values of a. We thus 
introduce a modified AdaTron algorithm which yields a good performance for 
all values of a. We also investigate the effects of optimization of the learning 
rate as well as of the learning algorithm. Both methods give excellent learning 
curves proportional to a~^. The latter optimization is related to the Bayes 
statistics and is shown to yield useful hints to extract maximum amount of 
information necessary to accelerate learning processes. 
PACS numbers: 87.10.+e 
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Typeset using REVT^ 



I. INTRODUCTION 



In artificial neural networks, the issue of learning from examples has been one of the most 
attractive problems Traditionally emphasis has been put on the off-line (or batch) 

learning. In the off-line learning scenario, the student sees a set of examples (called a training 
set) repeatly until the equilibrium is reached. This learning scenario can be analyzed in the 
framework of equilibrium statistical mechanics based on the energy cost function which 
means student's total error for a training set or on other types of cost functions 0-0- 
However, recently, several important features of learning from examples were derived from 
the paradigm of on-line learning. In the on-line learning scenario, the student sees each 
example only once and throws it out, and he never sees it again. In other words, at each 
learning stage, the student receives a randomly drawn example and is not able to memorize 
it. The most recent example is used for modifying the student weight vector only by a small 
amount. The on-line learning has an advantage over the off-line counterpart that it explicitly 
carries information about the current stage of achievement of the student as a function of 
the training time (which is proportional to the number of examples). 

During these several years, many interesting results have been reported in relation to the 
on-line learning. Among them, the generalization ability of multilayer networks is one of 
the central problems P|-pil||. Multilayer neural networks are much more powerful machines 
for information representation than the simple perceptron. 

Recently, the properties of neural networks with a non-monotonic transfer function have 



also been investigated by several authors [|TT]-[T6|. A perceptron with a non-monotonic 



transfer function has the same input-output relations as a multilayer neural network called 
the parity machine. This parity machine has one hidden layer composed of three hidden 
units (the K = ?> parity machine). The output of each unit is represented as sgn(— u), 
sgn(— a — u) and sgn(a — m), where u = \fN (J-x)/| J|. Here J is the A^-dimensional synaptic 
connection vector and x denotes the input signal. Then the final output of this machine 
is given as the product sgn(— 'u)-sgn(— a — M)-sgn(a — u). We regard this final output of 
the K = ?) parity machine as the output of a perceptron with non-monotonic transfer 
function. Recently, Engel and Reimers investigated the generalization ability of this non- 
monotonic perceptron following the off-line learning scenario. Their results are summarized 
as follows; For < a < cxd, there exists a poor generalization phase with a large generalization 
error. As the number of presented patterns increases, a good generalization phase appears 
after a first order phase transition at some a. No studies have been made about the present 
system following the on-line learning scenario. In this paper we study the on-line learning 
process and the generalization ability of this non-monotonic perceptron by various learning 
algorithms. 

This paper is organized as follows. In the next section we introduce our model system 
and derive the dynamical equations with respect to two order parameters for a general learn- 
ing algorithm. One is the overlap between the teacher and student weight vectors and the 
other is the length of the student weight vector. In Sec. HI, we investigate the dynamics of 
on-line learning in the non-monotonic perceptron for the conventional perceptron learning 
and Hebbian leaning algorithms. We also investigate the asymptotic form of the differential 
equations in both small and large a limits and get the asymptotic behavior of the gener- 
alization error. In Sec. IV we investigate the AdaTron learning algorithm and modify the 
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conventional AdaTron algorithm. In this modification procedure, we improve the weight 
function of the AdaTron learning so as to adopt it according to the range of a. In Sec. V, 
we optimize the learning rate and the general weight function appearing in the on-line dy- 
namics. As the weight function contains the variables unknown for the student, we average 
over these variables over distribution function unknown using the Bayes formula. Section 
VI contains concluding remarks. 

II. THE MODEL SYSTEM AND DYNAMICAL EQUATIONS 

We investigate the generalization ability of the non-monotonic perceptrons for various 
learning algorithms. The student and teacher perceptron are characterized by their weight 
vectors, namely Jg and Bg 3ft^ with |B| = 1, respectively. For a binary input signal 
xG {—1, +1}^, the output is calculated by the non-monotonic transfer function as follows: 

Ta{v) = sigYi[v {a - v) {a + v)] (1) 

for the teacher and 



5'a(M) = sign [^(a - u)(a + u)] (2) 

for the student, where we define the local fields of the teacher and student as 
t> = V^(B-x)/|B| and -u = \/iV(J-x)/| J|, respectively. The on-line learning dynamics is de- 
fined by the following general rule for the change of the student vector under presentation 
of the mth example; 



rm+l 



J'- + /(T,(i;),n)x. (3) 



Well-known examples are the perceptron learning, / = —Sa{u) Q{—Ta{v)Sa{u)), the Hebbian 
learning, / = Ta{v), and the AdaTron learning, / = —ulQ{—Ta{v)Sa{u)). 

We rewrite the update rule, Eq. (H), of J as a set of differential equations introducing the 
dynamical order parameter describing the overlap between the teacher and student weight 
vectors R"^ = (B-J"^)/\J"^\ and another order parameter describing the norm of the student 
weight vector /™ = \J'^\/\/N. By taking the overlap of both sides of Eq. (||) with B and by 
squaring both sides of the same equation, we obtain the dynamical equations in the limit of 
large m and N keeping a = m/N finite as 

\<^fiUv),u) + 2f{Uv),u)ul:^ (4) 



and 



da 21 



dR 1 2 



da P'^ ~ -^f{Ta{v),u) - {Ru - v)f{Ta{v),u)l:^. (5) 
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Here ^ ■ • • ^ denotes the average over the randomness of inputs 



•C 



dudv{- ■ ■)Pr{u, v) 



(6) 



with 



Pr(u, v) 



1 



exp 



{u^ + - 2Ruv)\ 
2(1 -i?2) ) 



(7) 



As we are interested in the typical behavior under our training algorithm, we have averaged 
both sides of Eqs. (|) and (^) over all possible instances of examples. The Gaussian 
distribution (|^) has been derived from the central limit theorem. 

The generalization error, which is the probability of disagreement between the teacher 
and the trained student, is represented as eg = <^Q{—Ta{v)Sa{u))':^. After simple calcula- 
tions, we obtain the generalization error as 




2 / DvH , , 



- 2 



DvH 



a + Rv 



+ 2 DvH 







a — 


Rv \ 


VI- 


-Ry 



(8) 



where we have set H{x) = J!^ Dt with Dt = dt exp {—t"^ / 2)/ y/2Tr. 

We would like to emphasize that the generalization error obtained above (§) is indepen- 
dent of the specific learning algorithm. In Fig. 1, we plot E{R) = eg for several values of 
a. This figure tells us that the student can acquire a perfect generalization ability if he is 
trained so that R converges to 1 for all values of a. We have confirmed also analytically that 
E{R) is a monotonically decreasing function of R for any value of a. 



III. HEBBIAN AND PERCEPTRON LEARNING ALGORITHMS 

A. Hebbian learning 

We first investigate the performance of the on-line Hebbian learning / = Ta{v). We get 
the differential equations for / and R as follows 



dl 

da 


"1 

[2 + 


2R 
y/2^ 


(1-2A)/ 


/I 


dR 


R 




l-2A)(l-i?2)/ 


da 


2 


^' 



(9) 
(10) 



To determine whether or not R increases with a according to a, we approximate the differ- 
ential equation for R around i? = as 
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Therefore we use i? = 1 — e for a > = v21og2 and R = e — 1 for a < ac- When a > ac, we 
obtain 

_ 1 1 + 2A 1 

- v/2^1-2A7^ ^^^^ 

and 

I = - 2A)a. (13) 

On the other hand, for a < Oc we obtain 

1 1 + 2A 1 

ea = l + ^= (14 



and 

/ = -2A)«. (15) 

V TT 

We see that the Hebbian learning algorithms lead to the state R = —1 for a < ac- 



B. Perceptron learning 

We next investigate the on-line perceptron learning / = —Sa{u) Q{—Ta{v)Sa{u)) by 
solving the next differential equations numerically; 

^=[^-E{R)-F{R)l]/l (16) 

li n 1 

— = [--E{R)R + {F{R)R - GiRym' (17) 

where F{R) = ^e{-Ta{v)Sa{u))Sa{u)u^ and G{R) = <t:e{-Ta{v)Sa{u))Sa{u)v^. Using 
the distribution (0) we can rewrite these functions as 

F(/?) = ii^(l-2A) (18) 



and 

G{R) = -F{R) (19) 

where A = exp(— a^/2). In Fig. 2 we plot the change of R and / as learning proceeds under 
various initial conditions for the case of a = oo. We see that the student can reach the 
perfect generalization state R = 1 for any initial condition. The R-l flow in the opposite 
limit a = is shown in Fig. 3. Apparently, for this case the student reaches the state with 
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the weight vector opposite to the teacher, R = —1, after an infinite number of patterns are 
presented. From Eqs. (||) and (H), we should notice that the case of a = is essentially 
different from the case of a simple perceptron. 

Since the two limiting cases, a = oo and a = 0, follow different types of behavior, it 
is necessary to check what happens in the intermediate region. For this purpose, we first 



investigate the asymptotic behavior of the solution of Eqs. ([Tq ) and ([T7| ) near R = ±1 for 
large a. Using the notation R = 1—e, e—>-0, the asymptotic forms of E{R), F{R) and G{R) 
are found to be 



E{Ry 

FiRY 

G{Ry 



TC 

e 



/2n 

e 



1 + 2A) 
;i-2A) 

(1-2A). 



(20) 
(21) 

(22) 



Substituting these expressions into the differential equations (p!6| ) and (p!7D, we obtain 



:i+2A) 



-1 2/3 



3v^(l -2A)2_ 



1 / 1 + 2 A 



a 



-2/3 



2^ Vl - 2A 



3V2(1-2A)2 
(1 + 2A) 



1/3 



a 



1/3 



(23) 
(24) 



Therefore, the generalization error is obtained from Eq. 



as 



;i + 2A) 



;i + 2A) 



nl/3 



3v^(l -2A)2_ 



a 



-1/3 



(25) 



The asymptotic form of /, Eq. (0), shows that A should satisfy 2 A < 1 or a > a^. The 
assumption oi R = 1 — e with £— >0 thus fails if a < Oc. This fact can be verified from Eq. 
(p!7|) expanded around i? = as 



dR 

da 



;i-2A) 



(26) 



For a < ar, R decreases with a. Therefore, we use the relation R = e — s>0, instead of 



R 



e for a < ac. We then find the asymptotic form of the generalization error as 



1 + 



2A 



1 - 2A, 



1 



v27ra 



(27) 



and / goes to infinity as 



;i-2A)a. 



(28) 



These two results, Eqs. (pSD and 
between the two cases of a = and a 



confirm the difference in the asymptotic behaviors 

: OO. 
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We have found that the Hebbian and the conventional perception learning algorithms 
lead to the state R = —1 for a < Uc = A/21og2. This anti-learning effect may be understood 
as follows. If the student perceptron has learned only one example by the Hebb rule, 

J = Ta=oiv)^. (29) 

Then the output of the student for the same example is 

Sa=o{u) = -sgn(u) 

= — sgn (J-x) 

= -Ta=o{v). (30) 

This relation indicates the anti-learning effect for the a = case. Similar analysis holds for 
the perceptron learning. 



C. Generalized perceptron learning 



In this section, we introduce a multiplicative factor \u\'^ in front of the perceptron learning 
function, / = —\u\'^Q{—Ta{v)Sa{u))Sa{u), and investigate how the generalization ability 
depends on the parameter 7. In particular, we are interested in whether or not an optimal 
value of 7 exists. The learning dynamics is therefore 



rm+l 



\u\^Saiu)ei-Uv)SM)^- 



(31) 



The case of 7 = corresponds to the conventional perceptron learning algorithm. On the 
other hand, the case of 7 = 1 and a— >cxd corresponds to the conventional AdaTron learning. 
Using the above learning dynamics, we obtain the differential equations with respect to / 
and R as 



dl 
da 
dR 
da 



1 
I 
1 
P 



Eg{R) 



R 



IFg{R) 



-Eg{R) + {Fg{R)R-Gg{R))1 



where Eg{R), Fg{R) and Gg{R) are represented as 

EG{R) = <€.u'^Q{-Ta{v)Sa{u))^, 
FG{R) = <.\u['^^Q{-Ta{v)Sa{u))Sa{u):^ 



and 



GG{R) = <uVQ{-Ta{v)Sa{u))Sa{u)v::^. 



(32) 
(33) 



(34) 
(35) 



(36) 



Let us first investigate the behavior of the R-l flow near i? = 0. When R is very small, 
the right-hand side of Eq. (^) is found to be a 7-dependent constant: 



da Txl V2 2. 



2A), 



(37) 
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where T{x) is the gamma function. As the right hand side of Eq. (|37D is positive for any 7 
as long as a satisfies a > Qc, R increases around R = only for this range of a. Thus the 
generalized perceptron learning algorithm succeeds in reaching the desired state R = 1, not 
the opposite one R = —1, only for a > ac, similarly to the conventional perceptron learning. 
Therefore, in this section we restrict our analysis to the case of a > ac and investigate how 
the learning curve changes according to the value of 7. 

Using the notation R = 1 — e {e—>-0), we obtain the asymptotic forms of Eg, Fq and Gq 
as follows. 



C3 



7 + 1 



(38) 
(39) 

(40) 



where Ci = 227+V2r(7 + l)/7r(27 + 1), C2 = ^a'^^ ^/ V^ti , C3 = 27+3/2r(| + |)/7r(7 + 2) and 
Ci = 2 IS.a? / \/2tx. We first investigate the case of At^O (finite a), namely, 02,047^0. The 
differential equations (^) and (|33D are rewritten in terms of e and 5 = l/l a.s 



dS_ 

da 
de 

da 



— cie' 2 -I- 02^2 



7+ 



+ C2£:^ 




€36^2 — c^e 



— 2C45 



(41) 
(42) 



As 7 = corresponds to the perceptron learning, we now assume 77^0. When 7 > 0, the 
terms containing ci and C3 can be neglected in the leading order. Dividing Eq. (|41|) by Eq. 
(ff3), we obtain 



d5 5 [-02^^/72 - CiS 



de [c2fcV2/2 + 2c4£] 



(43) 



If we assume (Je^^^^^e or 5e^^'^<tie., Eq. ( ^3|) is solved as 5 = exp(— e), which is in contradiction 
to the assumption |5|<^1. Thus, we set 



C2 



(44) 



and determine h and c(> 1/2). Substituting (|^) into (|43|) , we find h = 8C4/C2 (c2, C4 > 0)and 
c = 3/2. The negative value of 5 = 1// is not acceptable and we conclude that R does not 
approach 1 when 7 > 0. 

Next we investigate the case of 7 < 0. Using the same technique as in the case of 7 > 0, 
we obtain 



6 = ^ 



ci(l + 7)(l-7^) 
6ci(7 + 2) 
2C3 ^7±2\ 



_ 2 
a 3. 



4c,, 



ci V7 + I, 



Ci(l-7' 



(45) 
(46) 
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and 



TT 



1 + 2A) 



ci(l + 7)(l-7') 



6ci(7 + 2) 



a 3 



^(l + 2A)/(7)a- 

TT 



(47) 



We notice that 7 should satisfy — 1 < 7 < 0, because the prefactor of the leading term of 
6, namely, (2c3/ci)(7 + 2)/(7 + 1), must be positive. As the prefactor of the generalization 
error increases monotonically from 7 = —! to 7 = 0, we obtain a smaller generalization 
error for 7 closer to —1. 

Next we investigate the case of a— s>oo, namely 02,04 = 0. We first assume l^lo in the 
limit of a^oo. In this solution, dl/da = should be satisfied asymptotically. Then, from 
Eq. (PID, the two terms e'^+a and e-^'^'^ should be equal to each other, namely, = e^+i, 
which leads to 7 = 1. The learning dynamics (^) with a^oo and 7 = 1 is nothing but 
the AdaTron learning which has already been investigated in detail 
generalization error is 



Igf. The result for the 



if we choose /q as /q = 1/2, and 



2a' 



4 

3a 



(48) 



(49) 



if we optimize Iq to minimize the generalization error. 

We next assume /— >cxd as a— *-cxo. It is straightforward to see that e has the same asymp- 
totic form as in the case of At^O and 7 < 0. Thus we have 



V2 



TT 



f2{l)a 



(50) 



where 72(7) is defined as 



/2(7) 



7r(l + 7)(l-7^)r(7 + l) 
6-25/212(2 + i) 



(51) 



and 7 can take any value within — 1 < 7 < 0. 

From the above analysis, we conclude that the student can get the generalization ability 
if and only if a^oo and 7 = 1 (AdaTron). For other cases the generalization error 
behaves as a"^/^, the same functional form as in the case of the conventional perceptron 
learning, as long as the student can obtain a vanishing residual error. Therefore the learning 
curve has universality in the sense that it does not depend on the detailed value of the 
parameter 7. 
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IV. ADATRON LEARNING ALGORITHM 



A. AdaTron learning 

In this subsection, we investigate the generahzation performance of the conventional 
AdaTron learning / = —ulQ{—Ta{v)Sa{u)) |T8|. The differential equations for / and R are 
given as follows: 



dR _ R 
da 2 



EAd{R) — GAd{R) 



where ^Ad(^) = 't:u^e{-Ta{v)Sa{u)):^ and GAd(^) 
simple calculations, we obtain 

Ru ' 



(52) 
(53) 

'^uvQ{-Ta{v)Sa{u)):^. After 



pa p—a\ 

+ 2( / + / ]Duu' 

10 J-oo/ 



H 



and 



H 



a + Ru 



(54) 



GAd{R) = EAd{R)R 

ARaA ,^ 2x 



+ 



+ 



'2tt 
2(1 



(1 - R') 



H 



TT 

Aexp 



2(1 - i?2) 



a{l + R) 



Aexp 



H 



aR 



'a(l-^\ r 



a^jl + R)^ 
' 2(1 -i?2) 



Aexp 



a2(l-i;)2 
' 2(1 -i?2) 



+ exp 



a2 


l' 


2(1 - i?2) 


~ 2 



(55) 



At first, we check the behavior of R around R 
(E^i) around i? = 0, we obtain 



0. Evaluating the differential equation 



dR _4: / 1^ 2 
da TT \ 2, 



(56) 



From this result we find that for any value of a, the flow of R increases around = 0. In 
Fig. 4, we display the flows in the R-l plane for several values of a by numerical integration 
of Eq. (0). This figure indicates that the overlap R increases monotonically, but R does 
not reach the state = 1 if a is finite. This means that the differential equation (|53D with 
respect to R has a non-trivial fixed point R = Ro{< 1) if a < oo, which is the solution of 
the non-linear equation REAd{R) = 2(jAd(-R)- Therefore, we conclude that for a = oo and 



0, we obtain the generalization error as 



' a 



but the generalization error converges 



to a finite value exponentially for finite a. In Fig. 5, we plot the corresponding generalization 
error. 
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B. Modified AdaTron learning 



In the previous subsection, we found that the on-line AdaTron learning fails to obtain 
the zero residual error for finite a. In this subsection, we modify the AdaTron learning as 
/ = Q{-Ta{v)Saiu))h{u)l with 

{a - u (m > |) 
-M (-f < M < f ) (57) 
—a — u {u < — |) 

and see if the generalization ability of our non-monotonic system is improved. The moti- 
vation for the above choice comes from the optimization of the learning algorithm to be 
mentioned in the next section. Details of derivation of Eq. ( |57D are found in Appendix A. 
Then the differential equation with respect to R is obtained as follows. 

dR R? 

— = — —Ema{,R) — RFma{,R) + Gma{R) (58) 
aa Z 

where Eua{R) = ^h\u)e{-Ta{v)Sa{u)):^, Fua{R) = <:uh{u)e{-Ta{v)Sa{u)):^ and 
Gma{R) = ^vh{u)Q{—Ta{v) Sa{u))^ . To see the asymptotic behavior of the generalization 
error, we evaluate the leading-order contribution as R approaches 1, i? = 1 — e, as 



2 /2 

^MA — —{l + 2A)e'^ (59) 



2^/2 



'MA' 



TT 



(l + 2(l-a')A)£t (60) 



3 



Gma~ (61) 

71 



Substituting these expressions into the differential equation (0), we obtain e^/^ = \/2n/ (1 + 
2A)a^^ and the generalization error as 

V2(1 + 2A) 1 2 
eg = ^ = -. (62) 

TT a 

We should notice that the above result is independent of a and the generalization ability of 
the student is improved by this modification for all finite a. 



V. OPTIMIZED LEARNING 

A. Optimization of the learning rate 

In the present subsection, we improve the conventional perceptron learning by introduc- 
ing a time-dependent learning late pO|JT9[| . We consider the next on-line dynamics; 



im+l 



gia)Qi-Taiv)Saiu))Saiu)^. (63) 



11 



Using the same technique as in the previous section, we can derive the differential equa- 
tions with respect to / and R as follows. 



(64) 



(65) 

The optimal learning rate (^optla) is determined so as to maximize L{g{a)) to accelerate the 
increase of R. We then find 



dl 


1 




da 


~ 1 




dR 


1 




da 


~ P 






^L{g{a)). 



5'opt 



[F{R)R - G{R)]l 



RE{R) 

Substituting this expression into the above differential equations, we obtain 

dl_ _ [F{R)R - G{R)] [F{R)R + G{R)]l 
dR ~ 2R^E{R) 
dR _ [F{R)R-G{R)]'^ 
la ~ 2RE{R) ■ 



(66) 



(67) 
(68) 



We can obtain the asymptotic form of 5 (= 1 — i?), I and eg with the same technique of 
analysis as in the previous section; 



2^2(1 + 2A) 
(1 -2A)2 



a 



I = exp 



-16 



2A 



(1 -2A)2^ 



a 



(69) 
(70) 



and 



^(1 + 2A) 
vr 



2^2(1 + 2A) 
(1 -2A)2 



a 



Therefore, the generalization ability has been improved from a for = 1 to 
optimal learning rate gopt{<y) behaves asymptotically as 



a~ 



Qopt 



2^/2^ 
(1 - 2A) 



a ""^exp 



-16 



1 + 2A ^ 
,(1-2A)2 



a 



(71) 



The 



(72) 



The factor F{R)R — G{R) of gopt appearing in Eq. (|66|) is calculated by substituting F{R) 
and G{R) in Eqs. (|1|) and (0) as F{R)R - G{R) = (1 - R^)il - 2A)/v^. Thus, at 
a = ttc = v^2Tog^, the optimal learning rate vanishes. Therefore our formulation does not 
work at a = ttc. 

As the optimal learning rate gopt changes the sign at a = a^, from the arguments in 
section III, we can see why the optimal learning rate can eliminate the anti-learning. 
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In relation to this phenomenon at a = A/21og2, Van den Broeck pT] , p2| recently investi- 
gated the same reversed- wedge perceptron which learns in the unsupervised mode from the 
distribution 

P{v) = 2^^^^J^[Q{v-a) + Qiv + a)Qi-v)] (73) 
V 27r 

with V = \/iV(B-x)/|B|. For small a, he found R{a) ~ y/a < f >^ for the optimal on-line 
learning, where < ■■■ > denotes the average over the distribution ([73|) . Then he showed 
that at a = A/21og2, the distribution ([73| ) leads to < v >= and consequently R{a)=0. 
From this result, he concluded that as long as < f >= holds, any kind of on-line learning 
necessarily fails and the corresponding learning curve has a plateau. It seems that a similar 
mechanism may lead to a failure of the optimal learning at a = v^21og2 in our model. 



B. Optimization of the weight function using the Bayes formula 



In this subsection we try another optimization procedure by Kinouchi and Caticha p3 



We choose the optimal weight function f{Ta{v),u) by differentiating the right hand side of 
Eq. with the aim to accelerate the increase of R 

r = ^{v-Ru). (74) 

It is important to remember that /* contains some unknown information for the student, 
namely, the local field of the teacher v. Therefore, we should average /* over a suitable 
distribution to erase v from /*. For this purpose, we transform the variables u and f to m 
and z 



V = zVl - i?2 + Ru. (75) 
Then, the connected Gaussian distribution Pr{u, v) is rewritten as 

1 

PrM = ^-^_=exp(--)exp(--). (76) 

We then obtain 

< /* >= ^^^^l < z > (77) 

where < ■ ■ ■ > stands for the averaging over the variable v. Substituting this into the 
differential equation (||), we find 

dR (l-R^) 2 /«oN 

Let us now calculate < z >. For this purpose, we use the distribution P{z\y,u). This 
quantity means the posterior probability of z when y and u are given, where we have set 
y = Ta{v). This conditional probability is rewritten by the Bayes formula 
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Piz\y u) = ^("^^^^'"'"^ (79) 



from which we can calculate < z > as 



< z> 



(80) 



j dz z P{z\y,u) 

^ JdzzP{z)P{y\u,z) 

! dzP{z)P{y\u,z) 
^ / zDzP{y\u,z) 

J DzP{y\u,z) ■ 

Here P{y\u, z) is given as 

P{y\u, z) = y e{zVl-R^ + Ru) 

- y e(zVl - + Ru-a) 
+ y e{-zVl-R^ -Ru-a) 

+ \i^-y) (81) 

from the distribution y = Ta{v). Then, the denominator of Eq. (^) is calculated as 

J DzP{y\u,z) = y J DzQ{z^/l - R^ + Ru) 

-y j Dz Q{z^l - i?2 + Ru-a) 

+ y J Dz e{-zVl-R^ -Ru-a) + ^{l-y) 

= n{y\u), (82) 

where Q{y\u) means the posterior probability of y when the local field of the student u is 
given. As we treat the binary output teacher, we obtain from Eq. ( ^2l) 

^, , ^ , , Ru , a — Ru . , a + Ru . , . 

^ ' ^ ^ vr^^ ^^r^^^ ^^fY^w' ^ ' 

In Figs. 6 (i? = 0.5) and 7, {R = 0.9), we plot Q{+l\u) for the cases of a = 4.0, 2.0, 1.0 and 
a = 0.5. From these figures, we find that for any a Q{+l\u) seems to reach iTa{u) + l)/2 as 
R goes to +1. Using the same technique, we can calculate / Dz z P{y\u, z) and obtain 



r Vi - i?2 g 

J DzzP{y\u,z) = ^L___n{y\u). (84) 
Substituting this into the right hand side of dR/da, Eq. (|78|) , we obtain 

where ■ stands for the aver- 

aging over the distribution P{y,u) = J Dz P{y\u, z)P{u)P{z). Performing this average, 
we finally obtain 



14 



dR (l-R^) 



da AttR 



DuEa{R,u) 



(86) 



where 



Ea{R,a) 



X 



Al 



+ 



H{-A,)-H{A2) + H{A,) H{A,) + H{A2)-H{A 



(87) 



and Al = Ru/^l - R^, A2 = {a- Ru)/^/l - R^, A3 = (a + Ru)/Vl - R^- We plot the gen- 
erahzation error by numerically solving Eqs. (0), (0), (|67|), (|68|), and (|86D for the cases of 
a = 00 in Fig. 8 and a = 1.0 in Fig. 9. From these figures, we see that for the both cases 
of a = 00 and a < 00, the generalization error calculated by the Bayes formula converges 
more quickly to zero than by the optimal learning rate Qoptict)- 

Recently, Simmonetti and Caticha introduced the on-line learning algorithm for the 
non-overlapping parity machine with general number of nodes K. In their method, the 
weight vector of the student in each hidden unit is trained by the method in Ref. [23]. In 
order to average over the internal fields of teacher in the differential equation with respect 
to the specific hidden unit k of the student, they need the conditional probability which 
depends not only on the internal field of the unit k but also on the internal field of the other 
units (i^k). This fact shows that their optimal algorithm is non-local. In our problem, the 
input-output relation of the machine can be mapped to those of a single layer reversed-wedge 
perceptron. Therefore, it is not necessary for us to use the information about all units and 
our optimizing procedure leads to a local algorithm. 

In order to investigate the performance of the Bayes optimization, we have calculated 
the asymptotic form of the generalization error from Eq. (RUI) and the result is 



for e = 1 — R, where 



1 

£2 



;i + 2A)Ca 



(88) 



3/2 



H{t) ■ 



The generalization error is then given by Eq. (^0]) as 



(89) 



JZ.dtexpi-t^)/Hit) a 



1 1 
-~ 0.883 -. 

a 



(90) 



This asymptotic form of the generalization error agrees with the result of Kinouchi and 
Caticha 
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We notice that this form is independent of the width of the reversed wedge a. 
We next mention the physical meaning of u) appearing in the differential equation 

(^). As the rate of increase dR/da is proportional to Ha(-R, u), this quantity is regarded as 
the distribution of the gain which determines the increase of R. Therefore, Ea{R,u) yields 
important information about the strategy to make queries. A query means to restrict the 
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input signal to the student, m, to some subspace. Kinzel and Rujan suggested that if the 
student learns by the Hebbian learning algorithm from restricted inputs, namely, inputs 



lying on the subspace u = 0, the pref actor of the generalization error becomes a half |]25|. In 



the present formulation (p6D, a query- making can be incorporated by inserting appropriate 
delta functions in the integrand. The learning process is clearly accelerated by choosing the 
peak position of Ea{R,u) as the location of these delta functions. In Fig. 10 we plot the 
distribution 'Ea{R.,u) for a = 2.0 (top) and a = 0.8 (bottom). From these figures, we learn 
that for large a (= 2.0), the most effective example lies on the decision boundary [u = 0) 
at the initial training stage (small R). However, as the student learns, two different peaks 
appear symmetrically and in the final stage of training, the distribution has three peaks 
around u = and u = ±a. On the other hand, for small a (= 0.8), the most effective 
examples lie at the tails {u = ±oo) for the initial stage. In the final stage, the distribution 
has two peaks around u = ±a. Therefore it is desirable to change the location of queries 
adaptively. 



VI. CONCLUSION 

We have investigated the generalization abilities of a non-monotonic perceptron, which 
may also be regarded as a multilayer neural network, a parity machine, in the on-line mode. 
We first showed that the conventional perceptron and Hebbian learning algorithms lead 
to the perfect learning R = 1 only when a > ttc = A/2log2. The same algorithms yield 
the opposite state i? = — 1 in the other case a < ac- These algorithms have originally 
been designed having the simple perceptron (a = oo) in mind, and thus are natural to give 
the opposite result for the reversed-output system (a~0). In contrast, the conventional 
AdaTron learning algorithm failed to obtain the zero residual error for all finite values of a. 
For the unlearnable situation (where the structures of the teacher and student are different), 
Inoue and Nishimori reported that the AdaTron learning converges to the largest residual 



error among the three algorithms |T9[ . It is interesting that the AdaTron learning algorithm 
is not useful even for the learnable situation. 

In order to overcome this difficulty, we introduced several modified versions of the conven- 
tional learning rules. We first introduced the time- dependent learning rate into the on-line 
perceptron learning and optimize it. As a result, the generalization error converges to zero 
in proportion to except at a = v^21og2 where the learning rate becomes identically zero. 
We next improved the conventional AdaTron learning by modifying the weight function 
so that it changes according to the value of the internal potential u of the student. By 
this modification, the generalization ability of the student dramatically improved and the 
generalization error converges to zero with an a-independent form, 2a~^. 

We also investigated a different type of optimization: We first optimized the weight 
function f{Ta{v), u) appearing in the on-line dynamics, not the rate g. Then, as the function 
/ contains the unknown variable f , we averaged it over the distribution of v using the well- 
known technique of the Bayes statistics. This optimization procedure also provided other 
useful information for the student, namely, the distribution of most effective examples. 



Kinzel and Rujan |]23|] reported that for the situation in which a simple perceptron learns 



from a simple perceptron (the a = oo case), the Hebbian learning with selected examples 
(m = 0) leads to faster convergence of the generalization error than the conventional Hebbian 
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learning. However, we have found that for finite values of a, the most effective examples lie 
not only on the boundary u = but also on m = ±a. Furthermore, we could learn that for 
small values of a and at the initial stage of learning {R small), the most effective examples 
lie on the tails {u = ±00). As the learning proceeds, the most effective examples change the 
locations to n = ±a. This information is useful for effective query constructions adaptively 
at each stage of learning. 
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APPENDIX A: DERIVATION OF THE WEIGHT FUNCTION IN THE 
MODIFIED ADATRON LEARNING ALGORITHM 



In this appendix, we explain how we introduced the modified weight function 
Q{~Ta{v)Sa{u))h{u)l appearing in the AdaTron learning algorithm in Sec. IV B. From 
Eqs. ( [77D and (|8^) in Sec. V, the weight function using the Bayes formula is written as 



(Al) 



As this expression contains the unknown parameter R to the student, we try to find the 
suitable learning weight function which agrees with the asymptotic form of < /* > in the 
limit of R^l [|1^]. For this purpose, we investigate the asymptotic form of Q {y\u) as follows. 
We consider the cases of Ta=y = 1 and y = —1 separately. 



(I)y = l 

Using the relation R = 1 — e, £— >0, we find 



n{y\u)=H 



Ru 



H 



a — Ru 



+ H 



erfc 



erfc ( -: — — I + erfc 



2v^ 



a + Ru 
^ a + u 



(A2) 



The asymptotic form of Q {y\u) depends on the range of u. For u > a, the asymptotic form 
of Q (y\u) is 



1 



u — a \ TT 



exp 



{u — aY 



(A3) 



Therefore, < f* > /I = —{u — a). Similarly, we find </*>// = (0 < m < a and u < —a) 
< f* > /I = -u (-a/2 < n < 0) and < f* > /I = -{u + a) {-a < u < -a/2). 
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{ll)y = -l 

Using the relation R = 1 — e, we find for u > a 



u — a \ n 



£ 

— exp 



{u - af 
As 



(A4) 



Therefore, the weight function < /* > // is asymptotically. Similarly, we find < f* > /I = 
(a/2 <u < a and -a < u < 0), < f* > /I ^ -u {0 < u < a/2) and </*>// = -(a + m) 
{u < —a). 

Prom the results of (I) and (II) , we find the modified AdaTron learning algorithm as 



im+l 



^ J"^ + e{-Ta{v)Sa{u))h{u)U 



(A5) 



where 



h{u) 



a-u (li > f ) 
-u (-f < M < I) 
^-a-u («<-§) 



(A6) 
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FIGURES 



FIG. 1. Generalization error as a function of the overlap R for several values of a. The student 
should be trained so that the overlap goes to 1. 

FIG. 2. Trajectories of the R-l flow for a = oo. All R-l flows converge to the state oi R = 1 
after infinite number of examples are represented. 

FIG. 3. Trajectories of the R-l flow for a = 0. All R-l flows converge to the state R = — 1. 
Therefore, the corresponding generalization error does not converges to the ideal value of zero for 
this case. 

FIG. 4. Trajectories for the conventional AdaTron learning. Except for the case of a = oo 
and a = (overlapping), the trajectories converge to the state I = 0. 

FIG. 5. Learning curves corresponding to Fig. 4. For the two cases of o = oo and a = 
(overlapping), the generalization errors converge to zero as a~^. However, for the other cases, 
generalization errors converge to the finite value exponentially. 

FIG. 6. Shapes of 0(+1|m) for R = 0.5. 

FIG. 7. Shapes of 0(+l|?x) for R = 0.8. We see that for any a 0,{+l\u) seems to reach 
(Taiu) + l)/2 as i? goes to +1. 

FIG. 8. Learning curves of perceptron, optimized perceptron and Baysian optimization algo- 
rithms for a = oo. The Baysian optimization algorithm is the best among the three. 

FIG. 9. Learning curves of perceptron, optimized perceptron and Baysian optimization algo- 
rithms for a = 1.0. The Baysian optimization algorithm gives the best result among the three. 

FIG. 10. Distributions of the gain Ea{R,u) for a = 2.0 (top) and a = 0.8 (bottom). The peak 
positions give the best place to make queries. 
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